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ABSTRACT 

It is generally assumed that test administrators are 
accurate and dependable, and that the psychometric properties of 
validity and reliability applied to test givers are at acceptably 
high levels. The test giver is thought to have been standardized 
through training reinforced by experience. This paper considers 
validity and reliability in relation to test givers regarded as 
instruments of measurement. The test giver may be regarded as part of 
the instrument he or she uses, or the giver may be seen as the master 
instrument in charge of the others. It must be acknowledged that 
there are differences among test administrators. To ensure the best 
assessment by the test giver as instrument of assessment, the 
following must be addressed: (1) acknowledging that the giver is a 
person; (2) not reviewing a child's records before the assessment; 
(3) referring to other reports before drafting one's own; (4) talking 
to other test givers regularly, particularly about scoring; (5) 
providing training on the issue of behavioral observations; and (6) 
ensuring that test developers acknowledge the role of the test giver. 
Two tables illustrate the discussion* (Contains 38 references.) 
(SLD) 
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Remodeling Our View of Assessment: The Test Giver as Instrument 

Janet F Carlson 



The article invokes a literal image of Test Givers as measurement devices, and 
explores the psychometric properties of these instruments. Criterion and 
content validity are described, followed by test-retest, parallel forms, and 
internal consistency reliability. Recommendations for improving Test Givers* 
psyc hometric properties are offered. 



In many ways, \vc assume that those who 
administer tests arc accurate and dependable. In 
essence, we assume the psychometric properties of 
validity and reliability, as applied loTest Givers, arc ai 
acceptably high levels For the most part, these 
properties arc thought to have been established largely 
by virtue of one's graduate training. In addition, 
incremental gains in the Test Giver's accuracy and 
consistency are nssumcd to occur during one's 
internship, professional experiences, continuing 
educational experiences, and on-going exposure to 
supervision such as might occur in peer review 
processes. 

Virtually all graduate programs that train Test 
Givers attempt to make them uniform. Those who 
have taught in graduate-degree programs and. perhaps, 
have taught courses in testing will recognize that 
trainers do not encourage diversity when it comes to 
learning to administer a standardized test. In fact, we 
emphasize the opposite-uniformity. There is a sense 
that Test Givers should be interchangeable. In essence, 
the Test Giver is thought to have been "standardised" 
during his or her training, and this standardization is 
reinforced in subsequent experiences. 

However, given the variability of training 
programs, and the variability of instructors in courses 
relating to test giving, it seems unreasonable to expect 
such a high degree of sameness among Test Givers. 
Uniformity may characterize Test Givers in a 
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particular program or in a particular course, but it is 
unlikely to hold across instructors and across graduate 
piograms. In thif light, let us reflect upon the 
psychometric properties of validity and reliability, by 
framing these formal issues in relation to the 
instruments under consideration now-Test Givers. 

Validity 

In snort, the validity of an instrument reflects the 
extent lo which it measures what it is intended to 
measure. Most of our graduate programs trained us 
well in terms of gathering information that answers the 
referral question. But just how docs a Test Giver 
become accurate or valid in the first place? Most 
likely, the process is initiated during specific graduate 
courses and training experiences. So the beginning 
Test Giver takes a course in administering intelligence 
tests, and is instructed on how to do this accurately 
Achieving accuracy is often equaled with rigorous 
instruction on the "how to's" of test administration, 
scoring, and interpretation. 

But many Test Givers who conduct assessments as 
part of their everyday professional activities have 
encountered unique answers to test questions-answers 
that do not appear (even remotely) in the lest manual. 
Even after one's administration and scoring has been 
"standardized* by graduate training, such occurrences 
are not uncommon. In an attempt to limit the impact 
of such events, graduate training tends to emphasize 
the overarching principles of scoring over the specific 
responses. So students learn to score by considering 
the general guidelines rather than the actual words or 
phrases used by a test laker to answer a given question 
However, even straightforward subtests, such as 
Information, can be problematic when they produce 
curious responses. Theoretically in this subtest, as in 
several others, there is one correct response to each 
questioi , with an occasional second or third option that 
receives credit as well. Follow-up prompts or 
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questions are delineated clearly in the test manual But 
many of us would question what to say in response to 
the child who says "Fred" when asked, "What do we 
call a baby cow?" We might be especially interested in 
probing the response, or perhaps even giving credit for 
it, if the response came from an inner city child of 
limited means, who has had little experience with rural 
terms for baby farm animals. 

Many giaduate programs emphasize the 
administration, scoring, and interpretation of tests. 
Most also attempt to address the underlying constructs 
at issue-constructs such as intelligence. Of the many 
things traditionally covered in courses on test giving, 
administration and scoring appear to be the most 
simple, task-oriented components of the process. So a 
logical question might be: How successful are 
graduate programs in teaching these sorts of skills? 

Many studies have demonstrated that graduate 
students and professionals alike commit numerous 



errors both in administration and in scoring of 
standardized assessment instruments (e.g., Blakely, 
Fantuzzo, Gorsuch, 8c Moon, 1987-. Brannigan, 1975. 
Conner & Woodall, 1983; Franklin, Stillman, 
Burpeau, & Sabers, 1982; Hanna, Bradley, & Hoten, 
1981; Miller & Chansky, 1972; Moon, Blakely. 
Gorsuch, & Fantuzzo, 1991; Moon. Fantuzzo, & 
Gorsuch, 1986; Sherrels, Gard, & Langner, 1979; Slate 
& Chick, 1989; Thompson & Bulow, 1994; Warren & 
Brown, 1972). Much of the research in this area has 
focused upon intelligence tests, and the implications 
have addressed such factors as accuracy and the effects 
on placement decisions that follow from such mistakes. 
Slate and Hunnicutt (1988) reviewed the literature on 
Wecbsler scoring errors and suggested several factors 
that might account for the departures that are so widely 
noted, including carelessness on the part of the Test 
Giver and poor instruction on the pan of the trainer. 



Table 1 

W1SC-R Subtest and Composite Scores Assigned by Different Scorers for Same Subject 









Verbal 










Performance 






Composites 


Scorer lnf 


Sim 


Ari 


Voc 


Com 


DSp 


PC 


PA 


BD 


OA 


Cdg 


Mz 


V1Q 


P1Q 


FS1Q 


1 


12 


15 


10 


13 


13 


10 


J4 


14 


13 


13 


8 


12 


115 


117 


118 


2 


12 


19 


10 


13 


14 


7 


14 


14 


13 


11 


8 


7 


122 


114 


121 


3 


12 


15 


10 


13 


13 


10 


14 


14 


13 


11 


8 


12 


115 


114 


117 


4 


12 


15 


10 


12 


12 


10 


9 


14 


13 


11 


8 


8 


113 


106 


111 


5 


12 


15 


10 


14 


14 


10 


13 


14 


13 


11 


8 


10 


118 


112 


118 


6 


12 


18 


10 


12 


13 


10 


14 


14 


13 


11 


8 


12 


118 


114 


118 


7 


12 


16 


10 


n 


13 


10 


14 


14 


13 


11 


8 


12 


117 


114 


118 


8 


12 


18 


10 


13 


13 


10 


14 


14 


14 


11 


8 


12 


118 


115 


120 


9 


12 


19 


10 


12 


14 


9 


14 


14 


13 


11 


8 


12 


120 


114 


120 


10 


12 


16 


10 


13 


13 


10 


14 


14 


13 


1) 


8 


12 


117 


114 


118 


11 


12 


16 


10 


12 


12 


5 


14 


11 


13 


11 


8 


12 


114 


114 


116 


12 


12 


16 


10 


1? 


14 


10 


14 


14 


13 


11 


8 


12 


117 


114 


118 


13 


12 


16 


10 


1? 


13 


10 


14 


14 


13 


11 


8 


12 


115 


114 


117 


14 


12 


16 


10 


12 


13 


4 


14 


14 


13 


11 


8 


12 


115 


114 


117 


15 


12 


17 


10 


12 


14 


10 


14 


14 


14 


11 


8 


12 


118 


115 


119 


Mean 


12.0 


16.5 


10.0 


12.5 


13.2 


9.0 


13.6 


14.0 


13.1 


11.1 


8.0 


11.3 


116.8 


113.7 


117.7 


s.d. 


0.0 


1.41 


0.0 


0.O4 


OAS 


2.0 


1.30 


00 


0.35 


0.52 


0.0 


1,62 


2.37 


2.35 


2.28 


Expert 11 


15 


10 


13 


13 


10 


14 


14 


13 


11 


8 


12 


115 


114 


117 


%agr 100 


26.7 


100 


40.0 


53.3 


73.3 


86.7 


10(1 


86.7 


93.3 


100 


80.0 


26,7 


66,7 


20.0 




100 


66.7 


100 


ioo 


100 


HO.U 


93.3 


100 


100 


93.3 


100 


80,0 


33.3 


80,0 


66.7 



32 



In a subsequent empirical article. Slate, Jones, and 
Murray (1991) note that practice administrations of 
Wechsler tests merely permitted students in training to 
practice their errors rather than to improve their 
proficiency. They also note, as have others, that 
Verbal subtests are particularly prone to examiner 
errors. Predictably, the most frequent number of errors 
on Wechsler tests occurred for Vocabulary, 
Comprehension, and Similarities, followed by Picture 
Completion and Information. The ten most common 



types of errors made were (1) a failure to record 
something (response or time), (2) assigning too many 
points, (3) a failure to question a response, (4) 
questioning inappropriately, (5) assigning too few 
points, (6) incorrect conversion of raw scores to 
standard score, (7) failure to obtain a "ceiling", (8) 
failure to assign points correctly on Performance items, 
(9) incorrect raw score for subtest total-a math error, 
and (10) incorrect calculation of chronological age- 
another math error. 



Table 2 

Sample of Behavioral Observations Made by Different Scorers for Same Subject 



Scorer Observations 



3 V, remained attentive throughout the testing procedures. She was cooperative in answering 

test items but offered little spontaneous speech. V. appeared concerned with her performance. 
She was especially persistent and worked deliberately on the performance subtests. On some 
of the verbal subtest items, she clearly stated her lack of knowledge (e.g., "We don't study 
things like that," "I have no idea"). Throughout testing, V. was frequently moving her left 
foot underneath the table. 

8 V, did not fidget in her chair. She seemed c onto n able and often smiled at I he examiner 

revealing that there was a nice rapport established. She also helped the examiner in some 
cases which also showed her comfortness and patience. V, had little to say throughout the 
test and appeared to be quite confident in her answers, and sure that she had never heard 
some of the words before and therefore just did not know the answers to them. 

10 V. was very cooperative during testing. She helped put the materials away on object 
assembly. She followed directions. She was also rather quiet when not asked for a response. 
V. was clicking her heels during a few of the subtests. In digit span, she mouthed the 
numbers and was playing with her hair. V. tended to fix the cards in picture arrangement 
not the blocks in block design when she was finished. 

11 V. seemed anxious at times. Her foot was rocking underneath the table a lot during testing. 
Also she would cover her face, giggle and lookd down when she didn't know something, 
especially the verbal subtests. However, generally, she seemd confident and enjoyed 

the tasks. She concentrated well and was careful with her work. She was also pleasant and 
cooperative. She stated that the mazes and blocks and a few of the puzzles were hard for her. 

13 V, is a white girl of average height and weight. She appeared neat and well groomed at the 

time of testing. V. spoke softy throughout the testing and spoke with moderate affect. During 
testing, she sat with her arms folded in front of her on the talbc. She also sat up 
straight in her chair without slouching. While V. showed little undue anxiety, she did 
demonstrate that she was taking the testing situation seriously, as described by her behaviors 
above. She worked quietly on most tasks with little verbalization other than what she was 
asked to contribute. Overall, she was cooperative and followed directions given to her. 
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In a small-scale empirical investigation of my own, 
1 examined beginning examiners' scoring errors. My 
exploration differed from others in that 1 limited the 
scope to include only scoring errors by presenting 
students with a videotaped administration of a W1SC- 
R Thus, all the students had to do was record and 
score responses. Because they did not need to be 
concerned with querying, setting up test materials, or 
obtaining proper basals and ceilings, they could not 
make these kinds of mistakes. None of the 15 students 
in this small study made math errors, but they did 
mnkc all the rest The numerical results of this study 
nrc displayed in Table 1. 

Also of considerable concern is what has been left 
out in research of this kind Traditionally, the research 
has looked nt score accuracy somewhat and at 
competence in administration. Behavioral 
observations arc notoriously absent from consideration, 
with but a few exceptions (e.g., Glutting, Oakland, & 
McDcmioil, 1989; Kaplan, 1991). It seems another 
assumption is made regarding Test Givers-thcy have: 
an innate capacity to observe behavior accurately and 
need little instruction or guidance on these tasks. 

Certainly, it cannot be that behavioral observations 
arc unimportant Indeed, authors on this subject nearly 
always note that "behavioral observations" arc pan of 
the formal report (Ownbv, 1987; Ross-Reynolds. 1990. 
Tallciit. 1981; Zuckcrman. 1989). Some have made 
specific suggestions about which behaviors to include 
in this section of the report One practitioner and 
internship supcivisor \ know routinely notes that the 
behavioral olisct vat tons section of a formal report is 
boih the most important section and the mosi difficult 
section to write 

Just how accurate arc Test Givers in their 
observ ations of behavior? Table 2 contains sonic of the 
behavioral observations made by students about the 
videotaped subject used in my study. Although the 
descriptions are all of the same child, the differences 
arc apparent, and leave us questioning how to improve 
the accuracy and/or standardization of this important 
part of assessment. Errors made in observing behavior 
may be more difficult to address than errors of 
administration and scoring, because they are more 
vague, more elusive and more open to subjective 
interference. And although some research has been 
directed at identifying scoring and administration 
errors, relatively little has appeared with a focus on 
behavioral observations, as noted earlier 

A view of criterion validity can be had by 
considering the information presented in Table 1. If 
each individual Test Giver's scoring pattern is 
compared \o that produced by the panel of the experts. 



the comparison yields a kind of criterion validity. 
Here, the accepted criterion measure to which 
individual results (i.e., the scores assigned by 
individual Test Givers) are compared is the score 
profile produced by the expert panel. The extent to 
which individual Test Giver's scores agree with those 
of the expert panel yields a measure of criterion 
validity for each individual Test Giver. 

Alternatively, Test Givers can be imagined as a 
group of items, with each item representing an 
individual Test Giver. That is, it is possible to think 
of all Test Oivcrs collectively as a single instrument, 
because we shape Test Givers in groups and try to train 
good "troops" of psychologists or counselors or 
whatever, as far as their test giving is concerned. At 
least for the ensuing discussion regarding content 
validation, it is helpful to think in these terms. 

One obvious question that follows from this view of 
Test Givers and the content validity issue under 
discussion has to do with the adequacy of the sample. 
If Test Givers arc seen collectively as an instrument of 
assessment, and if each individual Test Giver is 
regarded as an item of that instrument, one might 
question how well the items represent the domain of 
interest. Arguably, the domain of interest is "human 
beings" or t more narrowly for our purposes, citizens of 
the United Slates Clearly, the domain should be 
inclusive rather than exclusive, as it would not be 
desirable to have the collection of Test Givers exclude, 
in whole or in pan. identifiable segments of the general 
populous So the content validity question becomes: 
To what extent does the collection of Test Givers 
reflect the U S citi/cury' 1 The answer: To a limited 
extent, at best, given what is known about the 
demographic characteristics of the various professions 
concerned with assessment. The undc representation 
of most minority groups, bilingual persons, and 
persons from lower socioeconomic and disadvantaged 
backgrounds speak to this issue. We must admit that 
content validity, as defined above, is weak. 

Rclatedly, questions may be raised about the 
process of item development. That is, thinking as one 
does when developing a traditional instrument, a test 
developer must be concerned with the characteristics of 
the pool of items, The process is analogous to that of 
traditional item development. We start with a pool of 
items-more than we plan to retain. Items are 
eliminated on the basis of poor performance. In the 
same way, graduate schools start with more potential 
Test Givers than will actually complete their training. 
Along the way. some of these arc eliminated for 
reasons including poor performance. Most graduate 
schools have attempted to improve the 
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representativeness of the item poo! by active efforts to 
recruit and retain underrepresented Test Givers. Some 
have succeeded in these efforts; others have attempted 
to remedy pool problems by providing additional 
training or requiring additional course work in 
multiculturalism or multicultural service delivery and 
so forth. 

Reliability 

Reliability refers to the dependability of lest scores; 
that is, their consistency. Essentially, reliability 
reflects the confidence one can have that test scores 
will remain the same across time, across persons (that 
is, scorers), across versions of the same instrument, 
and across portions of the test itself. If Test Givers are 
instruments, too, then they are expected to be reliable. 
That is, it seems reasonable to examine the consistency 
of their scores. 

It is possible to design studies to explore the 
consistency of scores assigned by Test Givers across 
time. Doing so would correspond to a determination of 
tcst-rctcst reliability. To do this, one might have Test 
Givers assign scores to a number of test responses, wail 
a while, and have them do it again. To my knowledge, 
such a study has not been conducted. 

A similar approach could be used to explore Test 
Givers' observations of a test taker's behavior during 
the assessment process. This kind of research might 
involve videotaping test administrations, showing the 
tapes to several Test Givers and having them rate the 
test laker's behavior on a behavioral rating scale at two 
different points in time. They could also draft a few 
paragraphs describing each test taker's behavior, and a 
number of judges could then evaluate the similarities of 
the descriptive paragraphs. 

If we once again invoke the image of the Test 
Givers as independent instruments, a kind of reliability 
estimate that mimics parallel or alternate forms takes 
shape. Each Test Giver is, after all, thought to be 
interchangeable and so imagining them ah as parallel 
is not difficult. Some research has appeared along 
these lines (e.g., Kasper, Throne, & Schulman, 1968; 
Oakland, Lec, & Axelrad, 1975). The data in Table 1 
also can be examined in light of this kind of reliability. 
The instruments (i.e., the Test Givers) all saw the same 
video and heard the same responses and had very 
similar instruction. Theoretical ly, they should have 
arrived at the same scores. 

One could also view the Test Givers collectively, as 
a single instrument as previously suggested. If this 
were the case, then a kind of Internal consistency 
measure could be approximated. For example, a 
rudimentary spht-hair reliability might be 



accomplished by using ^ odd/even split and 
computing the correlation coefficient between the two 
halves. Even without performing the calculation, one 
can see that the internal consistency of this instrument 
(the collective group of Test Givers) is quite high. 
Some research investigating consistency of scoring can 
be viewed as addressing internal consistency (e.g., 
Bradley, Hanna, & Lucas, 1980; Miller, Chansky, & 
Gredler, 1970; Ryan, Prifitera, & Powers, 1983). 

Factors Influencing Test Performance 

In numerous studies, a variety of factors have been 
suggested to influence the test taker's performance. 
F)r example, performance on intelligence tests has 
been explored in relation to such factors as the Test 
Giver's sex, age. ethnicity, socioeconomic status, 
training and experience, appearance, and personality 
characteristics (Anastasi, 1988, p. 38). Significant 
findings have emerged for all of these factors at one 
time or another, but the soundness of some of these 
studies puts their findings in question. Still, these 
types of investigations raise questions about how sonic 
of these same factors may affect the Test Giver or 
procedures used by him or her during lest 
administration and scoring (Gcishiger & Carlson, in 
press). That is. when these faclurs arc gmnj 1 , in the 
other direction- when they emanate Horn the test taker 
to the Test Giver, what cITccis. if nuy. occur'' 

We should consider that assessments of students 
from bilingual or culminlh tin else hackgi omuls ui.iy 
need to break with (taditmu somewhat (Rogers. 1993. 
Rosado, 1986) Similarly, decisions emanating from 
these assessment may need Ui piocccd in n manner that 
takes into account environmental laclois (Reynolds & 
Kaiser, 1990). Perhaps these assessments and 
decisions will need to make greater use of observation 
and judgment and less nsc of standardized Instruments 
In keeping with this idea. Hj'.ucma (1990) states that 
in conducting assessments of such students, there is 
"no reason to assume that a judgment call will 
contain more error than a psychometric test 11 But no 
one said it will contain less either At the very least. 
Test Givers need to bear in mind what effects culture 
may have not only on the test taker's behavior and 
performance, but also on the Test Giver's interpretation 
of scores and especially of test behavior (American 
Psychological Association. 1991; Dana, 1993; Miller- 
Jones, 1989: Ogbu. 1988). For example* the Test 
Giver should not draw the same conclusions for all test 
takers who do not make eye contact or do not converse 
readily with him or her, as these factors are likely to be 
shaped differently by different cultures. The Test 
Giver must be aware of the many influences that a 
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particular culture or subculture may have on a child's 
behavior (Geisinger, 1992). 

Conclusions and Recommendations 

Perhaps it is best to think of Test Givers as pan of 
each instrument they use, rather than as separate 
instruments entirely. Or, wc could view the Test Giver 
as the "master" instrument in charge of the others. 

If w adopt the first position-that the Test Giver 
becomes a part of whatever instrument he or she uses- 
then we would have to acknowledge that when we 
administer what is considered to be a standardized 
instrument, there is one component of that instrument 
that does not stay the same. It would be comparable to 
opening a WISC or Stanford-Btnet kit and having a 
new subtest each time, or at least several new items. 

Indeed, during the standardization of most tests, 
test developers U se or train experts to administer the 
test, and review these administrations to eliminate 
differences in procedures or interpretations. Typically 
these differences arc eliminated in advance of the 
norming process. In doing so. the test developers arc 
trying to override the person-to-person differences 
inherent in Test Givers who-aHcr all is said and donc- 
-are people. In a very real sense, this procedure 
bespeaks the role of the Test Giver as a pan of the 
instrument. With the kind of close scrutiny that is 
given to these Test Givers, differences stemming from 
person-to-person variations arc considerably reduced. 

Of course, most Test Givers do not have the benefit 
of such close scrutiny and validity checks once they 
have completed graduate school. When a new edition 
of a test is developed, some professionals find it 
necessary or desirable to attend training workshops in 
order to be updated on the changes and procedural 
implications of the changes. But many times, this docs 
not seem necessary or is not within the budget, or i s 
impossible for some other reason, and psychologists 
end up teaching themselves the revised edition 
(Chattin & Bracken, 1989; Dumont & Faro, 1993). 

Given that Test Givers arc pan of the assessment 
and might even be considered as instruments 
themselves, what can wc do-in light of the foregoing- 
to be better at the task of assessment? What can be 
done to improve the psychometric properties of these 
instruments? A few suggestions are offered below. 

I. Acknowledge that you arc a person. You have 
traits-physical and personal oncs-lhat enter into the 
assessment process, no matter how rigorously your 
training program tried to obliterate these. So during 
assessments, take note of how test takers interact with 
you. That is, note how test takers typically respond to 



you-do they view you as the "enemy", as*a "friend", as 
a "parent", as someone who is going to "uncover a 
secret"? If so, lhen>w are bringing that to the testing 
situation And when the test taker's behavior reflects 
this trait of yours, you must sec the behavior not as 
something that belongs to the test lakers so much as 
belonging to you. 

2. Do not review test results of a child referred 
for revaluation before conducting the assessment. 
Although doing so creates an appearance of reliability, 
this form of reliability concerns the inanimate 
instruments primarily and not the Test Givers. At best, 
it supports the reliability of the assessment procedure 
sans the Test Giver. 

3. After collecting the assessment information and 
drafting the report, refer to the other report/s before 
finalizing yours. Doing so can-but will not 
ncccssarily-serve as a validity check. Of course, one 
would not expect identical assessments to emerge, but 
making use of information in the record will highlight 
changes that have occurred and might also indicate 
those areas in which some double-checking might be in 
order. Address differences that appear, hopefully in 
light of changes in the subject, rather than errors in the 
instruments (thai is. the Test Givers). 

4. Talk to other instruments (i.e.. Test Givers) 
regularly. Discuss the scoring of specific items 
Consider exchanging record forms with your 
colleagues from lime to time in order to check 
consistency of your scores. Look for a pattern .do you 
score low? high? Arc there particular types of tcsib. 
subtests or items where you tend to differ? Resolve 
those differences to the maximum extent possible. 

5. For those who train Test Givers, consider 
including some training on the issue of behavioral 
obse nations, and some activity that relics upon 
consensus, such as viewing the same test 
administration, scoring it, and writing » havioral 
observations. 

6. For test developers, acknowledge the role of 
the Test Giver in a forthright manner. Include in 
the test manual reports on typical variations in scores, 
and identify the factors that contribute (o these 
variations. They arc not all random errors. At the 
time of test standardization, test developers might also 
sponsor or design research to address the role of the 
Test Giver. 
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